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EMPIRICAL BAYESIAN ANALYSIS OF SIMULTANEOUS CHANGEPOINTS 

IN MULTIPLE DATA SEQUENCES 

ZHOU FAN AND LESTER MACKEY 


Abstract. Copy number variations in cancer cells and volatility fluctuations in stock prices are 
commonly manifested as changepoints occurring at the same positions across related data sequences. 
We introduce a Bayesian modeling framework, BASIC, that employs a changepoint prior to capture 
the co-occurrence tendency in data of this type. We design efficient algorithms to sample from 
and maximize over the BASIC changepoint posterior and develop a Monte Carlo expectation- 
maximization procedure to select prior hyperparameters in an empirical Bayes fashion. We use the 
resulting BASIC framework to analyze DNA copy number variations in the NCI-60 cancer cell lines 
and to identify important events that affected the price volatility of S&P 500 stocks from 2000 to 
2009. 


1. Introduction 

Figure 1 displays three examples of aligned sequence data. Panel (a) presents DNA copy number 
measurements at sorted genome locations in four human cancer cell lines [Varma et al., 2014]. 
Panel (b) shows the daily stock returns of four U.S. stocks over a period of ten years. Panel 
(c) traces the interatomic distances between four pairs of atoms in a protein molecule over the 
course of a computer simulation [Lindorff-Larsen et ah, 2011]. Each sequence in each panel is 
reasonably modeled as having a number of discrete “changepoints,” such that the characteristics 
of the data change abruptly at each changepoint but remain homogeneous between changepoints. 
In panel (a), these changepoints demarcate the boundaries of DNA stretches with abnormal copy 
number. In panel (b), changepoints indicate historical events that abruptly impacted the volatility 
of stock returns. In panel (c), changepoints indicate structural changes in the 3-D conformation 
of the protein molecule. For each of these examples, it is important to understand when and in 
which sequences changepoints occur. However, the number and locations of these changepoints 
are typically not known a priori and must be estimated from the data. The problem of detecting 
changepoints in sequential data has a rich history in the statistics literature, and we refer the reader 
to [Basseville and Nikiforov, 1993, Chen and Gupta, 2012] for a more detailed review and further 
applications. 

In many modern applications, we have available not just a single data sequence but rather many 
related sequences measured at the same locations or time points. These sequences often exhibit 
changepoints occurring at the same sequential locations. For instance, copy number variations 
frequently occur at common genomic locations in cancer samples [Pollack and Brown, 1999] and in 
biologically-related individuals [Zhang et al., 2010], economic and political events can impact the 
volatility of many stock returns in tandem, and a conformational change in a region of a protein 
molecule can affect distances between multiple atomic pairs [Fan et al., 2015]. As recognized in 
many recent papers, discussed below, an analysis of multiple sequences jointly may yield greater 
statistical power in detecting their changepoints than analyses of the sequences individually. In 
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Figure 1. (a) DNA copy numbers in four cancer cell lines, indicated by fluorescence 
intensity log-ratios from array-CGH experiments, (b) Daily returns of four U.S. 
stocks, (c) Distances between four pairs of atoms in a computer simulation of a 
protein molecule. 


addition, a joint analysis may more precisely identify the times or locations at which changepoints 
occur and better highlight the locations where changepoints most frequently recur across sequences. 

Motivated by these considerations, we introduce a Bayesian modeling framework, BASIC, for 
carrying out a Bayesian Analysis of Simultaneous Changepoints. In single-sequence applications, 
Bayesian changepoint detectors have been shown to exhibit favorable performance in comparison 
with other available methods and have enjoyed widespread use [Chernoff and Zacks, 1964, Yao, 1984, 
Barry and Hartigan, 1993, Stephens, 1994, Chib, 1998, Fearnhead, 2006, Adams and MacKay, 
2007]. In Section 2, we propose an extension of Bayesian changepoint detection to the multi¬ 
sequence setting by defining a hierarchical prior over latent changepoints, which first specifies the 
sequential locations at which changepoints may occur and then specifies the sequences that contain 
a changepoint at each such location. 

Inference in the BASIC model is carried out through efficient, tailored Markov chain Monte 
Carlo (MCMC) procedures (Section 3.1) and optimization procedures (Section 3.2) designed to 
estimate the posterior probabilities of changepoint events and the maximum-a-posteriori (MAP) 
changepoint locations, respectively. These procedures employ dynamic programming sub-routines 
to avoid becoming trapped in local maxima of the posterior distribution. To free the user from 
pre-specifying prior hyperparameters, we adopt an empirical Bayes approach [Robbins, 1956] to 
automatic hyperparameter selection using Monte Carlo expectation maximization (MCEM) [Wei 
and Tanner, 1990] (Section 3.4). 

To demonstrate the applicability of our model across different application domains, we use our 
methods to analyze two different data sets. The first is a set of array comparative genomic hy¬ 
bridization (aCGH) copy number measurements of the NCI-60 cancer cell lines [Varrna et al., 2014], 
four of which have been displayed in Figure 1(a). In Section 5, we use our method to highlight 
focal copy number variations that are present in multiple cell lines; many of the most prominent 
variations that we detect are consistent with known or suspected oncogenes and tumor suppressor 
genes. The second data set consists of the daily returns of 401 U.S. stocks in the S&P 500 index 
from the year 2000 to 2009, four of which have been displayed in Figure 1(b). In Section 6, we 
use our method to identify important events in the history of the U.S. stock market over this time 
period, pertaining to the entire market as well as to individual groups of stocks. 

Comparison with existing methods: Early work on changepoint detection for multivariate 
data [Srivastava and Worsley, 1986, Healy, 1987] studied the detection of a change in the joint 
distribution of all observed variables. Our viewpoint is instead largely shaped by [Zhang et al., 
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2010], which formulated the problem as detecting changes in the marginal distributions of subsets 
of these variables. A variety of methods have been proposed to address variants of this problem, 
many with a particular focus on analysis of DNA copy number variation. These methods include 
segmentation procedures using scan statistics [Zhang et al., 2010, Siegmund et ah, 2011, Jeng 
et al., 2013], model-selection penalties [Zhang and Siegmund, 2012, Fan et al., 2015], total-variation 
denoising [Nowak et al., 2011, Zhou et ah, 2013], and other Bayesian models [Dobigeon et ah, 2007, 
Shah et ah, 2007, Harle et ah, 2014, Bardwell and Fearnhead, 2017]. Here, we briefly highlight 
several advantages of our present approach. 

Comparing modeling assumptions, several methods [Jeng et ah, 2013, Bardwell and Fearnhead, 
2017] focus on the setting in which each sequence exhibits a baseline behavior, and changepoints 
demarcate the boundaries of non-overlapping “aberrant regions” that deviate from this baseline. 
Shah et ah [2007] further assumes a hidden Markov model with a small finite set of possible signal 
values for each sequence. However, data in many applications are not well-described by these 
simpler models. For instance, in cancer samples, short focal copy number aberrations may fall 
inside longer aberrations of entire chromosome arms and overlap in sequential position, and true 
copy numbers might not belong to a small set of possible values if there are fractional gains and 
losses due to sample heterogeneity. Conversely, the Bayesian models of [Dobigeon et ah, 2007, 
Harle et ah, 2014] are very general, but their priors and inference procedures involve 2 J parameters 
(where J is the number of sequences), rendering inference intractable for applications with many 
sequences. By introducing a prior that is exchangeable across sequences, we strike a different 
balance between model generality and tractability of inference. 

Comparing algorithmic approaches, we observe in simulation (Section 4) that total-variation 
denoising can severely overestimate the number of changepoints, rendering them ill-suited for ap¬ 
plications in which changepoint-detection accuracy (rather than signal reconstruction error) is of 
interest. In contrast to recursive segmentation procedures, our algorithms employ sequence-wise lo¬ 
cal moves, which we believe are better-suited to multi-sequence problems with complex changepoint 
patterns. These local moves are akin to the penalized likelihood procedure of [Fan et ah, 2015], 
but in contrast to [Fan et ah, 2015] where the likelihood penalty shape and magnitude are ad hoc 
and user-specified, our empirical Bayes approach selects prior hyperparameters automatically using 
MCEM. Finally, the BASIC approach provides a unified framework that accommodates a broad 
range of data types and likelihood models, can detect changes of various types (e.g. in variance as 
well as in mean), and returns posterior probabilities for changepoint events in addition to point 
estimates. 


2. The BASIC Model 

Suppose X £ M JxT is a collection of J aligned data sequences, each consisting of T observations. 
The BASIC model for A is a generative process defined by three inputs: an observation likelihood 
p(-\0) parameterized by 9 E © C M d , a prior distribution 7 tq on the parameter space 0, and a 
changepoint frequency prior ttq on [0,1]. For each sequence position t, a latent variable qt. £ [0,1] 
is drawn from 7 Tq and represents the probability of any sequence having a changepoint between its 
(t — l) th and t th data points. Then, for each sequence position t and sequence j, a latent variable 
Zj t £ {0,1} is drawn with Pr [Zjt = 1 \ = qt and indicates whether there is a changepoint in se¬ 
quence j between its (t — l) th and t th data points. Finally, for each t and j, a latent likelihood 
parameter 6jj £ 0 and an observed data point Xjj are drawn, such that 6j.t remains constant 
(as a function of t) in each data sequence between each pair of consecutive changepoints of that 
sequence and is generated anew from the prior 7T0 at each changepoint, and Xjj is a conditionally 
independent draw from p(-\9j.t). This process is summarized as follows: 
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Figure 2. An illustration of the BASIC model. In this illustration, distinct values 
of 9 are drawn from 7r© = Normal(0,5), and values of X are drawn from p(-\0) = 
Normal (0,1). 
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For notational convenience, we arrange Zjj into a matrix Z G {0,1 } JxT , fixing Zj i = 0 for all 
j = 1,..., J. Figure 2 illustrates this generative model in the case where the piecewise-constant 
parameter 6jj represents the mean of the distribution of Xjj. and Xj.t is normally-distributed 
around this mean with fixed unit variance. Our primary goal in this model will be to infer the 
latent changepoint variables Z upon observing the data X. 

A key input to the model is the prior distribution ttq over [0,1], which controls how frequently 
changepoints occur and to what extent they co-occur across sequences. Rather than requiring the 
user to pre-specify this prior, Section 3.4 develops an empirical Bayes MCEM procedure to select 
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7 tq automatically. Specifically, we parametrize ttq as a mixture distribution 

nQ = ^2wkVk, (1) 

fceS 

where {vk}k£S is a fixed finite dictionary of probability distributions over [0,1] and {wk}keS are non¬ 
negative mixture weights summing to 1, and the MCEM maximum marginal likelihood procedure 
selects the weights {uifcjfces. In our applications, we will simply take the dictionary {vk\k&s to be 
discrete point masses over a fine grid of points in [0,1]. 

The choices of the likelihood model p(-\0) and the prior distribution 7r© are application-dependent. 
For our analysis of DNA copy number variations in Section 5, we use a normal model for p(-\0) where 
8 parametrizes the normal mean, and 7r© is the normal conjugate prior. For our analysis of stock 
return volatility in Section 6, we use a Laplace model for p(-\0) with mean 0 and scale parameter 
8, and 7r© is the inverse-Gamma conjugate prior. We provide details on these and several other 
common models in Appendix A. Our inference procedures are tractable whenever the marginal 

/ s— 1 

Y[p(X jir \6)ir e m (2) 

r=t 

may be computed quickly from Pj(t,s — 1) and Pj(t — l,s). This holds, in particular, whenever 
p(-\8) is an exponential family model with 7r© the conjugate prior, as Pj(t, s ) may be computed by 
updating a fixed number of sufficient statistics. Any unspecified hyperparameters of 7r© can also 
be selected automatically using the MCEM procedure of Section 3.4. 

We have assumed for notational convenience that each data sequence is generated from the same 
parametric family p{-\8) with the same prior 7r©. In applications where sequences represent different 
types of quantities, the choices of p{-\9) and 7r© should vary across sequences, and our posterior 
inference algorithms are easily extended to accommodate this setting. 

3. Inference procedures 

In this section, we give a high-level overview of our algorithms for posterior inference in the 
BASIC model, deferring details to Appendices B-D. Our primary task is to perform posterior 
inference of the unobserved latent changepoint variables Z , given the observed data X. Assuming 
7 tq and 7r© are fixed and known, Section 3.1 presents an MCMC procedure for sampling from 
the posterior distribution Pr(ZjX), and Section 3.2 presents an optimization algorithm to locally 
maximize this posterior distribution over Z to yield a MAP estimate. Section 3.4 presents an MCEM 
method to select ttq and 7r©, following the empirical Bayesian principle of maximum marginal 
likelihood. An efficient implementation of all inference algorithms is available on the authors’ 
websites. 

We emphasize that even though the BASIC model is specified hierarchically, our inference algo¬ 
rithms directly sample from and maximize over the posterior distribution of only Z, analytically 
marginalizing over the other latent variables q and 8. Furthermore, these procedures use dynamic 
programming subroutines that exactly sample from and maximize over the joint conditional distri¬ 
bution of many or all variables in a single row or column of Z, i.e. changepoints in a single sequence 
or at a single location across all sequences. We verify in Appendix E that this greatly improves 
mixing of the sampler over a naive Gibbs sampling scheme that individually samples each Zj.t from 
its univariate conditional distribution. 

3.1. Sampling from the posterior distribution. To sample from Pr(Z|X), we propose the 
following high-level MCMC procedure: 

(1) For j = 1,..., J: Re-sample Zj t . from Pr(Zj j .|X, Z^_j^ .) 

(2) For t = 2,... , T: Re-sample Z. t t from Pr(Z.y|A, Z. ^_ t ^) 
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(3) For b = 1,... ,B: Randomly select t such that Zy t = 1 for at least one j, choose s = t — 1 
or s = t + 1, and perform a Metropolis-Hastings step to swap Z.j and Z. }S . 

We treat the combination of steps 1-3 above as one complete iteration of our MCMC sampler. 
Here, Zy., Z^_j\., Z.y, and Z.y_ t ) respectively denote the j th row, all but the j th row, the t th 
column, and all but the t th column of Z. In step 3, B is the number of swap attempts, which we 
set in practice as B = 10 T. 

To sample Zy. \ Zr_j\. in step 1, we adapt the dynamic programming recursions developed in 
[Fearnhead, 2006] to our setting, which require 0(T 2 ) time for each j. To sample Z.y \ Z. j_ f ^ in 
step 2, we develop a novel dynamic programming recursion which performs this sampling in 0( J 2 ) 
time for each t. Step 3 is included to improve the positional accuracy of detected changepoints, 
and the swapping of columns of Z typically amounts to shifting all changepoints at position t to a 
new position t + 1 or t — 1 that previously had no changepoints. This step may be performed in 
0{JT) time (when B = O(T)), so one complete iteration of steps 1-3 may be performed in time 
0(JT 2 + J 2 T). Details of all three algorithmic procedures are provided in Appendix B. 

3.2. Maximizing the posterior distribution. To maximize Pr(Z|A') over Z, we similarly pro¬ 
pose iterating the following three high-level steps: 

(1) For j = 1,..., J: Maximize Pr(Z|A) over Zy.. 

(2) For t = 2,.. . ,T: Maximize Pr(Z|A) over Z.y. 

(3) For each t such that Zyt = 1 for at least one j, swap Z.y with Z.j-i or Z. : t+i if this increases 
Pr(Z|A), and repeat. 

We terminate the procedure when one iteration of all three steps leaves Z unchanged. In appli¬ 
cations, we first perform MCMC sampling to select ttq and ttq using the MCEM procedure to be 
described in Section 3.4, and then initialize Z in the above algorithm to a rounded average of the 
sampled values. Under this initialization, we find empirically that the above algorithm converges 
in very few iterations. 

To maximize Pr(Z|X) over Zj in step 1, we adapt the dynamic programming recursions devel¬ 
oped in [Jackson et al., 2005] to our setting, which require 0(T 2 ) time for each j. Maximization 
over Z. jf in step 2 is easy to perform in 0(J log J) time for each t. Step 3 is again included to 
improve the positional accuracy of detected changepoints, and after an O(JT) initialization, each 
swap of step 3 may be performed in 0(J ) time. Hence one complete iteration of steps 1-3 may be 
performed in time 0{JT\ogJ + JT 2 ). Details of all three algorithmic procedures are provided in 
Appendix C. 

3.3. Reduction to linear cost in T. In practice, T may be large, and it is desirable to improve 
upon the quadratic computational cost in T. For sampling, one may use the particle filter approach 
of [Fearnhead and Liu, 2007] in place of the exact sampling procedure in step 1, adding a Metropolis- 
Hastings rejection step in the particle-MCMC framework of [Andrieu et al., 2010] to correct for 
the approximation error. For maximization, one may use the PELT idea of [Killick et al., 2012] to 
prune the computation in step 1, with modifications for a position-dependent cost as described in 
[Fan et ah, 2015]. 

In our applications we adopt a simpler approach of dividing each row Zy. into contiguous blocks 
and sampling or maximizing over the blocks sequentially; details of this algorithmic modification 
are provided in Appendices B-C. This reduces the computational cost of one iteration of MCMC 
sampling to 0(J 2 T ) and of one iteration of posterior maximization to 0(JT log J), provided the 
block sizes are O(l). In all of our simulated and real data examples, we use a block size of 50 data 
points per sequence. We examine the effect of block size choice in Appendix E. 

3.4. Empirical Bayes selection of priors ttq and ttq. To select ttq and 7r© automatically using 
the empirical Bayes principle of maximum marginal likelihood, we assume ttq is a mixture as in 
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Eq. 1 over a fixed dictionary {v k }, and we estimate the weights {w k }. We also assume that 7r© is 
parametrized by a low-dimensional parameter r], and we estimate r/. We denote Pj (f, s ) in Eq. 2 
by Pj(t,s\r]). 

Let S{Zj ) .) denote the data segments {(1, ), (ii,^ 2)1 ■ • •, (t k ,T + 1)} induced by changepoints 

Zj',. i.e., Zj itl = ... = Zj t t k = 1 and Zj :t = 0 for all other t. Let Ni = #{t > 2 : J2j =1 %j,t = 0 be 
the total number of positions where exactly l sequences have a changepoint. Our MCEM approach 
to maximizing the marginal likelihood over candidate priors operates on the “complete” marginal 
log-likelihood, 


logPr(X, Z\{w k }, rj) 

= log Pr(X\Z, r]) + logPr(Z\{w k }) 

= lo s s \v) J + N i log fj ] w k f ^(1 - q) J ~ l Vk(dq) \ ■ 

\j =l (M)eS(Zj,.) J 1=0 \k£S J / 

Starting with the initializations and EM iteratively computes the expected complete 

marginal log-likelihood (E-step) 

= ^‘z\x,{w^~ 1) },ip~ 1 ') ^og Pr(^> z \i w k}, 1 ?)] 

and maximizes this quantity to select new prior estimates (M-step) 

= argmax {wfc} r? ({w k }, rj). 

MCEM approximates the E-step by a Monte Carlo sample average, 

1 M 

E z|^- 1 )}^-i)[ 1 °g Pr (^^l{^}^)] ~ !ogPr(A,Z (m )| {vJ k },r]), 

m=l 


where Z^\ ..., Z^ M ^ are MCMC samples under the prior estimates {my P } and 1 - 1 . Maximiza¬ 
tion over {wifc} and r] are decoupled in the M-step: 

M J / 

= argmax{^ fc } ^ J2 rf™* log ( Wk 

m= 1 1=0 \k£S 

M J 

r)^ = argmax^ EE E log Pj{t,s\r]), 

m=l j= 1 ( tiS )g5(zj” 1 ')) 

where Ay™'' 1 = jj{t > 2 : Jjj-i zj^ = l}. Maximization over {w k } is convex, and we use a tailored 
KL-divergence-minimization algorithm for this purpose. We use a generic optimization routine to 
maximize over the low-dinrensional parameter 77 . In our applications, we take {v k }k£S to be point 
masses at a grid of points with spacing 1 / J and spanning the range [0, J/ 2), and we initialize {u^} 
to assign large weight at 0 and spread the remaining weight uniformly over the other grid points. 
We initialize r ^ by dividing the data sequences into blocks and matching moments. Details of the 
optimization and initialization procedures are given in Appendix D. 



4. Simulation studies 

4.1. Assessing inference on a small example. We first illustrate our inference procedures on 
the small data example shown in Figure 2, with J = 9 sequences and T = 100 data points per 
sequence. This data was generated according to the BASIC model (with 6 := (fi,cr 2 ), p(-\0) = 
Normal^, a 2 ), ttq given by p ~ Normal(0, 5) and a 2 = 1, and ttq = 0.9do + O.lc^/g)- 
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Figure 3. BASIC posterior inference on data generated from the BASIC model 
(see Section 4.1). Heatmaps (a-c) display the marginal posterior probabilities of 
change Pr (Zjj = 1|X) estimated by MCMC using (a) the true data-generating 
priors ttq and ttq (which in practice are unknown), (b) grossly incorrect priors, 
and (c) MCEM-selected priors. The MCEM procedure in (c) is initialized with the 
incorrect priors of (b) but recovers accuracy comparable to the idealized setting 
in (a). Under the MCEM priors of (c), panel (d) displays the MAP changepoint 
estimate in red and the true changepoints as black crosses. 

Figure 3 shows the effectiveness of the empirical Bayesian MCEM approach to inference in this 
setting. Panel (a) shows the marginal posterior changepoint probabilities Pr (Zj t = 1|X) computed 
with 50 MCMC samples after a 50-sample burn-in, in an idealized setting where the sampling is 
performed under the true priors ttq and 7r© that generated the data. The results of panel (a) 
represent an idealized gold standard, as “true priors” are typically unknown in practice. Panel (c) 
demonstrates, however, that performance comparable to the gold standard can be obtained using 
MCEM-selected priors, even when the MCEM algorithm is initialized with a grossly incorrect prior 
guess. In particular, panel (b) displays Pr (Zjj = 1\X) under the grossly incorrect prior choices 
fjL ~ A/"(0,10), a 2 = 10, and ttq = 0.2<5o + 0 .25i/g + 0.2d 2 /g + 0.2<5 3 / g + 0.2<5 4 / 9 , while panel (c) 
displays Pr (Zjj = 1|AT) when prior parameters are initialized to the same grossly incorrect choices 
and updated with an MCEM update after iterations 5, 10, 20, 30, and 50 of the burn-in. Notably, 
the posterior inferences using MCEM priors (panel (c)) are comparable to those of the idealized 
setting (panel (a)), despite this incorrect initialization. Finally, panel (d) shows the MAP estimate 
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Table 1. Errors averaged over 100 instances of the Section 4.1 simulation. Posterior 
inference using MCEM-selected priors recovers accuracy comparable to the idealized 
setting of using the true data-generating priors (“True priors”), even when initialized 
with grossly incorrect priors (“Wrong priors”). 




True priors 

Wrong priors 

MCEM priors 

Squared error of E{Z 

\X} 

8.1 

17.9 

8.3 

Squared error of E{# 

X} 

50.3 

151 

51.1 

0-1 changepoint error 

of Z MAP 

10.3 

14.9 

10.1 


of Z using the priors estimated in panel (c). In this example, the MAP estimate misses two true 
changepoints and makes two spurious detections. 

We repeated this simulation with 100 different data sets generated from the BASIC model. 
Table 1 summarizes results using three error measures (all averaged across the 100 experiments): 
the squared error of the posterior mean changepoint indicators JV f (K{Zj^ \ X} — Zj^ e ) 2 , the 
squared error of the posterior mean signal reconstruction I X} — 0^ e ) 2 , and the 0-1 

error of detected changepoints in the MAP estimate. All evaluation metrics indicate that posterior 
inference using the MCEM-selected prior consistently leads to accuracy comparable to the idealized 
gold standard of using the true data-generating prior. As a reference point for the difficulty of this 
simulated data, the average 0-1 changepoint error of applying a univariate changepoint method 
(PELT with default MBIC penalty in the “changepoint” R package, Killick et al. [2012]) to each 
data sequence individually is 12.6, which is 25% higher than that of our MAP estimate under the 
MCEM-selected prior. 

4.2. Comparing detection accuracy on artificial CNV data. The identification of copy num¬ 
ber variations (CNVs) in aCGH data for cancer cells represents one primary motivation for our 
work. As there is typically no known “gold standard” for the locations of all CNVs in real aCGH 
data, we will assess changepoint detection accuracy in a simulation study, applying our inference 
procedures to 50 simulated aCGH data sequences using the simulator from Louhimo et al. [2012] 1 . 
This simulator generates six CNVs that are either focal high-level (2-copy loss or 6-to-8-copy gain), 
focal medium-level (1-copy loss or 4-copy gain), or broad low-level (1-copy gain). The prevalence 
of each CNV across samples ranges between 5% and 50%. The simulator accounts for sample 
heterogeneity, with each sample corresponding to a random mixture of normal and abnormal cells. 

To apply BASIC, we performed 100 iterations of MCMC sampling after 100 iterations of burn- 
in, using a normal likelihood model with changing mean and fixed (unknown) variance, and with 
MCEM updates of prior parameters after iterations 10, 20, 40, 60, and 100 of the burn-in. We 
then performed MAP estimation using the resulting empirical Bayes priors, with Z initialized to 
the MCMC sample average. On this data, the BASIC MAP estimate achieved 100% accuracy; we 
report results in Appendix F. 

One way in which this synthetic data is easier than the real aCGH data we analyze in Section 
5 is that focal and broad CNVs span at least 50 and 500 probes, respectively, whereas they are 
shorter in our data of Section 5 and also in certain previous single-sample comparison studies [Lai 
et al., 2005]. To increase the difficulty in this regard, we subsampled every tenth point of each 
synthetic data sequence and analyzed the resulting sequences, in which focal CNVs span 5 probes 
and broad CNVs span 50. Results on this more challenging dataset are reported here. 

The accuracy of the BASIC MAP estimate is shown as the red star in Figure 4, where we 
plot the fraction of true changepoints discovered against the false-discovery proportion. Shown 

Mhis simulator also generates corresponding gene expression data; we ignored this additional data, as integration 
of these two data types is not the focus of our paper. 
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SIMPLE 

10.42 

CBS 

21.82 

cghseg 

29.23 

TVSp 

54.22 


Figure 4. Changepoint detection accuracy and signal reconstruction squared-error 
for various methods on simulated aCGH data from Louhimo et al. [2012] (see Sec¬ 
tion 4.2). Left: Fraction of true changepoints detected across all sequences, versus 
fraction of all changepoint detections that are false discoveries. Right: Total signal 
reconstruction squared-error, where is the estimated log 2 ratio at probe t in 
sequence j, and is its true value. For SIMPLE, we report the highest accuracy 
obtained across all values of its tuning parameter. 


also in Figure 4 are the results of several alternative methods: SIMPLE [Fan et al., 2015] to 
represent the penalized likelihood approach, TVSp [Zhou et al., 2013] to represent total-variation 
regularization, circular binary segmentation (CBS) [Olshen et al., 2004] applied separately to each 
sequence to represent a popular method of unpooled analysis, and cghseg [Picard et al., 2011] to 
represent a popular method of pooled analysis. We set the convergence tolerance of TVSp to 10" 14 
and ignored changes with mean shift less than 10 -3 to avoid identifying breakpoints because of 
numerical inaccuracy. We applied SIMPLE with a normal likelihood model; as the method does 
not prescribe a default value for the main tuning parameter, we plot its performance as the tuning 
parameter varies. All remaining parameters of the methods were set to their default values or 
selected using the provided cross-validation routines. 

Detection accuracy of the BASIC MAP estimate is near-perfect and competitive with the other 
tested methods—examination of its output reveals that it misses a focal (5-probe) medium-level 
loss in two sequences and a broad low-level gain in one sequence, and it makes one spurious 
segmentation in one sequence. Detection by cghseg is conservative, missing 10 focal gains and 
losses across all sequences. In addition, as cghseg does not attempt to identify changepoints at 
common sequential positions, it inaccurately identifies the location of 15 additional changepoints, 
which contributes both to an increased false discovery proportion and a reduced true discovery 
proportion. (This positional inaccuracy ranges between 1 and 5 probes.) Single-sequence CBS 
suffers from the same changepoint location inaccuracy. It is less conservative than cghseg, truly 
missing only 3 aberrations across all sequences, but also identifying 2 non-existent aberrations. 
TVSp partitions the data sequence into too many segments, yielding false-discovery proportion 
close to 1 for changepoint discovery. We do note that TVSp and its tuning-parameter selection 
procedure are designed to minimize the signal-reconstruction squared error, rather than changepoint 
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Figure 5. Comparison of methods by the total number and rate of detected change- 
points that are coincident across two technical replicates of real aCGH data for 59 
cancer cell lines (see Section 5). The performance of SIMPLE varies with its un¬ 
specified tuning parameter. 


identification error. However, we report the signal reconstruction errors alongside Figure 4 and 
observe that TVSp is also less accurate by this metric. 

SIMPLE yields performance close to that of BASIC under optimal tuning, but the authors of 
[Fan et al., 2015] provide little guidance on how to choose the tuning parameter. In the BASIC 
framework, the analogous hyperparameters of ttq are selected automatically by MCEM. 


5. Copy number aberrations in the NCI-60 cancer cell lines 

We applied our BASIC model to analyze CNVs in aCGH data for the NCI-60 cell lines, a set 
of 60 cancer cell lines derived from human tumors in a variety of tissues and organs, as reported 
in [Varma et ah, 2014]. We discarded measurements on the sex chromosomes, removed outlier 
measurements, and centered each sequence to have median 0; we discuss these preprocessing steps 
in Appendix G. We fit the BASIC model using a normal likelihood with changing mean and fixed 
variance, applying the same procedure as in Section 4.2. The runtime of our analysis on the pooled 
data (J = 125, T = 40217) was 2 hours. 

In this data, measurements for 59 of the 60 cell lines were made with at least two technical 
replicates. We used this to test the changepoint detection consistency of various methods, by 
constructing two data sets of 59 sequences corresponding to the two replicates and applying each 
method to the data sets independently. A detected changepoint is “coincident” across replicates if 
it is also detected in the same cell line at the same probe location in the other replicate. Figure 
5 plots the total number of coincident detections versus the fraction of all changepoint detections 
that are coincident, for the methods tested in Section 4.2. (We omit the comparison with TVSp due 
to its high false-discovery rate for changepoint identification.) BASIC has better performance than 
single-sample CBS, yielding more coincident detections also at a higher coincidence rate. BASIC 
is less conservative than cghseg, detecting more coincident changes but at a lower coincidence 
rate. Recall that the performance of SIMPLE varies with its unspecified tuning parameter. For 
comparable tunings of SIMPLE, BASIC yields slightly better performance: for the same level of 
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Figure 6 . Chromosome 1 aCGH measurements for four NCI-60 melanoma cell lines 
(black points) and associated BASIC estimates of marginal posterior changepoint 
probabilities using 100 MCMC samples (teal curves). Red dashed lines indicate 
BASIC MAP changepoint estimates. The estimated posterior mean of qt is displayed 
below in blue, providing a cross-sample summary of changepoint prevalence across 
all 125 analyzed sequences. 

changepoint coincidence across replicates, BASIC detects more changepoints, and for comparable 
numbers of detected changepoints, BASIC achieves a higher level of changepoint coincidence. 

We emphasize that a non-coincident detection is not necessarily wrong—for a changepoint de¬ 
marcating a low-level aberration against which a method does not have full detection power, a 
method may detect this change in one replicate but not the other. Conversely, a coincident detec¬ 
tion need not correspond to a true CNV, if technical artifacts are present in both replicates. The 
coincidence rate is not high for any tested method. Reasons for this include (1) changepoints due 
to technical drift, a common occurrence [Olshen et ah, 2004] which is particularly severe in some 
of the sequences of this data set; (2) probe artifacts that differ across replicates; and (3) low-level 
non-shared aberrations with boundary points that are difficult to precisely identify. The coinci¬ 
dence rate may be increased for all methods by applying post-processing procedures to remove 
changepoints due to technical drift and probe artifacts, although these procedures are usually ad 
hoc. 

Our BASIC framework provides not only a point estimate of changepoints, but also posterior 
probability estimates that may be valuable in interpreting results and also performing this type 
of post-processing. Figure 6 displays the log 2 -ratio measurements and the BASIC MAP estimate 
of changepoints in chromosome 1 for four distinct melanoma cell lines, alongside the estimated 
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marginal posterior changepoint probabilities. Figure 6 also displays the posterior mean estimate 
of qt (computed from the sampled Z matrices), which provides a cross-sample summary of the 
prevalence of shared changepoints across all analyzed sequences at each probe location. 

To illustrate one use of this posterior information, we performed a pooled analysis of all sequences 
(including all replicates to increase detection power and accuracy) in order to highlight genomic 
locations that contain focal and shared CNVs. First, we identified all pairs of genomic locations 
s and t on the same chromosome at distance less than 3 x 10 6 base pairs apart 2 such that at 
least two distinct cell lines had posterior probability greater than 90% of containing changepoints 
at both s and t. The interval between s and t is the identified CNV, and the sequences having 
posterior probability greater than 90% of change at s and t are the identified carriers of that CNV. 
To reduce false discoveries due to technical noise of the aCGH experiments, we restricted attention 
to those pairs for which this interval contained at least three microarray probes. Then, for each 
such pair, we computed the mean value of the data in the interval between s and t for the carrier 
sequences and compared this to the mean value in small intervals before s and after t. Figure 
7 shows the 20 identified CNVs that exhibit the greatest absolute difference between these mean 
values, displaying up to five distinct carriers of each CNV. CNVs that overlap in genomic position 
are grouped together in the figure. 

Many of the CNVs highlighted in Figure 7 contain genes that have been previously studied in 
relation to cancer; we have annotated the figure with some of these gene names. CDKN2A and 
CDKN2B are well-known tumor suppressor genes whose deletion and mutation have been observed 
across many cancer types [Kamb et ah, 1994, Nobori, 1994]. FBXW7 is a known tumor suppressor 
gene that plays a role in cellular division [Akhoondi et al., 2007]. MYC is a well-known oncogene 
that is commonly amplified in many cancers [Dang, 2012], URI1 is a known oncogene in ovarian 
cancer [Theurillat et ah, 2011]. FAF1 is believed to be a tumor suppressor gene involved in the 
regulation of apoptosis [Menges et ah, 2009]. Deletion of A2BP1 has been previously observed 
in colon cancer tumors and gastric cancer cell lines [Trautmann et ah, 2006, Tada et ah, 2010]. 
Deletion of APOBEC3 has been observed in breast cancer [Long et ah, 2013, Xuan et ah, 2013], 
although we detect its deletion in cell lines of cancers of the central nervous system and the lung. 
Deletion of CFHR3 and CFHR1 is not specifically linked to cancer, but it is a common haplotype 
that has been observed in many healthy individuals [Hughes et ah, 2006]. Many of the remaining 
CNVs in Figure 7 appear to represent true copy number variations present in the data (rather 
than spurious detections by our algorithm), but we could not validate the genes present in the 
corresponding genomic regions against the cancer genomics literature. 

6. Price volatility in S&P 500 stocks 

As a second example, we applied the BASIC model to analyze the volatility in returns of U.S. 
stocks from the year 2000 to 2009. We collected from Yahoo Finance the daily adjusted closing 
prices of stocks that were in the S&P 500 index fund over the entire duration of this 10-year period, 
and we computed the daily return of each stock on each trading day t as ( pt — Vt-i)IPt- \ , where 
Pt is its closing price on day t and pt-i is its closing price on the previous day. Our data consists 
of the returns for J = 401 stocks over T = 2514 trading days, and the total runtime of our pooled 
analysis was 1 hour. 

Previous authors have applied univariate changepoint detection methods to analyze daily returns 
of the Dow Jones Industrial Index from 1970 to 1972, modeling the data as normally distributed with 
zero mean and piecewise constant variance [Hsu, 1977, Adams and MacKay, 2007]. We observed 
empirically for our data that the tails of the distribution of daily returns are heavier than normal, 
and we instead applied BASIC using a Laplace likelihood with fixed zero mean and piecewise 
constant scale. We used the same MCMC/MCEM/MAP inference procedure as in Section 4.2. 

p 

We use 3 million base pairs as the cut-off to distinguish focal from non-focal CNVs. 
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Figure 7. The 20 most prominent focal CNVs present in at least two of the NCI-60 
cancer cell lines. Genes of interest in the aberrant regions are highlighted in red. 
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Figure 8 . Daily returns of four U.S. stocks from 2000 to 2009, with MAP change- 
point estimates (from a joint analysis of 401 stocks) shown in dashed red and model- 
based volatility estimates shown in solid red. The estimated posterior mean of qt is 
displayed below in blue. 

Shown in Figure 8 are the daily returns for American International Group Inc. (AIG), Aon Corp. 
(AON), Bank of America Corp. (BAC), and The Bank of New York Mellon Corp. (BK), together 
with MAP changepoint estimates and estimated marginal posterior change probabilities. Shown 
also is the cross-sample changepoint summary provided by the posterior mean of qt . Within this 
10-year period, the 15 trading days with the highest posterior mean for qt are, in chronological 
order: Sep 6 2001, Sep 17 2001, Jun 27 2002, Jul 1 2002, Aug 9 2002, Sept 24 2002, Nov 29 2002, 
Jul 24 2007, Aug 20 2007, Sep 15 2008, Sep 29 2008, Dec 9 2008, Jun 2 2009, Jun 3 2009, and Nov 
10 2009. The changepoints from 2001 to 2002 are attributable to the collapse of the dot-com bubble 
of the late 1990s and early 2000s, and those from 2007 to 2009 are attributable to the U.S. financial 
crisis. Several of these dates correspond to important events in U.S. stock market history, including 
Sep 17 2001 when the markets first re-opened after the World Trade Center terrorist attacks, Jul 
1 2002 when WorldCom stock fell in value by 93%, Sept 15 2008 when Lehman Brothers filed 
for Chapter 11 bankruptcy, and Sept 29 2008 when the U.S. House of Representatives rejected a 
proposed bailout plan for the financial crisis. 

Many other detected changepoints were local to small numbers of individual stocks. For instance, 
the changepoint detected on Oct 14 2004 and visible in the first two sequences of Figure 8 was shared 
across the seven stocks AIG, AON, Coventry Health Care, Hartford Financial Services, Marsh & 
McLennan, Merk &; Co., and Unum Group. Six of these seven stocks belong to the insurance 
industry, and the changepoint represents a brief spike in price volatility due to an insurance scandal 
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that was revealed on Oct 14 2004 when AIG publicly disclosed its involvement, along with Marsh 
Sz McLennan and others, in an illegal market division scheme, and civil and criminal charges were 
announced against Marsh & McLennan and employees at AIG pertaining to various allegations 
of corporate misbehavior. 3 Other examples of detected “locally-shared” changepoints include Oct 
10 2000, marking the beginning of a period of increased price volatility in the tech companies 
Amazon.com, Cisco Systems, EMC Corporation, JSD Uniphase, Oracle Corporation, and Yahoo! 
Inc.; and Feb 16 2005, coinciding with the date on which the international Kyoto Protocol treaty 
on carbon emissions took effect and marking the start of a period of increased price volatility in 
the energy companies Dominion Resources, Devon Energy, Public Service Enterprise Group, and 
Exxon Mobil. 

We may also use our methods to produce a smooth estimate of the historical volatility of stock 
prices, by computing the posterior mean of the Laplace scale parameter 9j.t. for each sequence j 
and each day t using the sampled Z matrices. The Laplace scale parameter 6j t t implies a standard 
deviation of x/2Qj jt ; red lines in Figure 8 are plotted at ±2 standard deviations to pictorially 
illustrate this volatility estimate. This estimate is smooth and resilient to outliers, while still 
exhibiting rapid adjustments to real structural changes in the data. 


Appendix A. Likelihood models 

For concreteness, we record here several practically-relevant choices of p(-\9) and n r© in the BASIC 
model, along with the corresponding computations for Pj(t,s ) in Eq. 2. In each of these settings, 
the prior distribution 7r© is parametric, and we denote the parameter of 7Te as r/. 


Normal model, changing mean and fixed variance: 

9 := (ju, cr 2 ), Xj <t \9 ~ Normal(/i, a 2 ) 

2 

V ■= (ho, A, <Tq), h\v ~ Normal(/r 0 , ^), cr 2 |? ? = cJq 

\ . .2 | 1 v2 (Vo+EJ=tL>) 2 

A ho + 2^r=t A j,r X+s—t 


Pj(t, s) = (2 tt al) "a 


A 

X + s — t 


exp 


2°o 


Normal model, changing variance and fixed mean: 

9 := (fj,,a 2 ), Xj t t\9 ~ Normal(/q cr 2 ) 

V := (ho, a,/3), a 2 \r] InverseGamma(a, /3), fi\r] = no 


Pj(t,s) = (2 tt) V ^ 


r(« + ¥) 


T(a) 


P + 


(s-*)Mq , Y^s-l 


X 2 


+ Yhr=t. fr ho Ylr=t Xj, 




Normal model, changing mean and variance: 

9 := (//, a 2 ), Xjj\9 ~ Normal(^t, a 2 ) 

V := (ho, A,a,/3), a 2 \ij InverseGamma(a, j3), hi® 2 , V ~ Normal (jio, 

r(« + ^) 




(a , ^1+Y.rJtXlr (AW+E^W.DA 

1 P ^ 2 2(A+s-i) J 


a +i 


( 3 ) 


( 4 ) 


( 5 ) 
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Source: “Just how rotten?”, The Economist, Special Report, 21 October 2004. 
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Poisson model, changing mean: 


9 := A, Xjj 1 9 ~ Poisson(A) 

iy.= (a,P), X\rj rs_/ Gamma(a, f3) 

/s -1 


= n 


x ir \ 

r=t J ’ r , 


r r (a + Er= t Xj,r) 

r(a) (^ -f l)«+Z)r=t 


Bernoulli model, changing success probability: 


9 := p, Xj )t \9 ~ Bernoulli (p) 
r):=(a,(3), p\r) ~ Beta(a, (3) 

r(g + 13) r (g + Er=t Xj,r) np + s-t - E s r zl x j>r ) 

’ r(a)r(/3) r(cH-/3 + s - i) 

Laplace model, changing scale and fixed zero mean: 


9 := u, rs_/ Laplace(0, u) 

rj:=(a,fi), v\p InverseGamma(a, /3) 

P j {t,s) = 2-( s ~ t ^ 


r(a + s — t) 


r(a) 


+ Ed l*i, 


a+s —t 


( 6 ) 


( 7 ) 


( 8 ) 


Appendix B. MCMC sampling algorithms 

Below are the details of the MCMC sampling steps discussed in Section 3.1. Throughout, we 
define the quantities 


f(k) = f q k (l -q) J - k 7T Q (dq), 

( 9 ) 

9 (k)= [ q k -\l-q) J - k n Q {dq ), 

( 10 ) 


for k = 0,...,</ in Eq. 9 and k = 1in Eq. 10. These quantities depend only on ttq and 
may be pre-computed outside of the sampling iterations. (If ttq is discrete or a mixture of Beta 
distributions, these quantities are easily computed analytically. Otherwise, these may be computed 
numerically for each k.) The computational costs of our MCMC sampling and MAP estimation 
procedures depend on ttq only via pre-computation of f(k) and g(k). 

Step 1: Gibbs sampling by rows 

To sample each row Zj t . conditional on the remaining rows we may employ the dy¬ 

namic programming recursions developed by Paul Fearnhead for the univariate changepoint prob¬ 
lem [Fearnhead, 2006], in the following manner. 

Let Nj(t) = (E/'=i ~ Zj,t denote the number of changepoints at position t in all but the 

j th sequence, and let Pr^ denote probability conditional on Z^_j^ ., with associated conditional 
expectation . Note that Nj ( t ) is deterministic under Pr-7*. Then the probability density function 
of qt conditional on is given, for each q E S, by 

Pr V\qt = q) oc Pr [Z { _ jU \q t = q) Pr {q t = q) = q N ^\l - q) J - N M~ 1 Pr( ft = q). 
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Letting Cj(t) := Pr= 1) = this implies that 


Cj(t) 


g(Nj(t) + !)• 


( 11 ) 


For each t > 1, let Qj(t) = Pr^ (Xjj : T\Zj_t = 1), and let Qj( 1) = Pr^pC^i^). Qj(t) is the joint 
probability density of the observed data in sequence j after and including position t, conditional 
on a changepoint having occurred in sequence j at position t and also conditional on the observed 
changepoints in all of the other sequences. Let Pj(t , s ) be as defined in Eq. 2. Then Qj(t) satisfies 
the following recursions, which are similar to those in Theorem 1 of [Fearnhead, 2006]: 


Qj(T) = Pr (j) (X jiT \Z jtT = 1) 

= Pj(T,T + 1), 

Qj(t) = \ Pl ^( z j,(t+i):{s—i) = 0) z j,s = 1| Zj,t = l) x 


( 12 ) 


k s=t -\-1 


Pl- (j) (Xj jt :T\Zj,t = 1) Z j,(t+ l):(s—1) = 0; Z j,s = l) ^ 

+ Pr^)(Z^( t+ i) :T = 0| Zj tt = 1) Pr*-^ {Xj tt:T \Zj t t = 1, Zj^ t+1 y T = 0) 

E f ri P^(Z j , r = 0)\pi^(Z jta = l)x 

s=t-\-l \r=t+l / 

Pl ' { X j,t:(s-l)\ Z j,t = 1, ^ 7 ,(t+l):(s-l) = 0, Z j,s = l) (Xj,s:T\ Z j,s = 1) ^ 

+ f II P r(j) ('^j,r = 0)^ Pr (Xjf :T \Z jtt = 1, Zj^ t+ iy T = 0) 


\r=t-\-l 
< s —1 


E IK 1 - c i( r )) c i( s ) P j(^ s )Qj( s ) + na- c i( r )) P j( t -- T + !)■ (I 3 ) 


\s=t- 1-1 \r=t -\-1 


\r=t +1 


Eq. 13 holds also for t = 1, by the same derivation. Eqs. 12 and 13 allow us to compute Qj(t) for 
t = T, T — 1,T — 2,..., 1 recursively via a “backward pass”. We may then sample each successive 
location where Zjj = 1, conditional on the data X and Zr_j\., in a “forward pass”: 

Pr (1) {Z jMt _ 1) =0,Z jit = l\X) 

= Pr ( P = 0, Zj jt = II-X^ijt) 

= Pr (j ' ) (A- J -, 1;T |Z J -, 1 : (t - 1) =0,Zj,t=l)PrOH^, 1; (t- 1) =0.^,t=l) 

I’r ;j; (.V. ; J;/ ) 

= Pr(Aj,i:(t-ul^,i:(t-i)=0,^,t = l) = Pr<j) ^>=0) PrO)(Z j|t = l) 

Prd) (Xj' 1:T ) 

_(i )<3 j (*) (n* =2 ( 1 — c j (»■))) c j (*) (-\ a\ 

Qj (i) ’ 

Pr^^(Zj i ( s+1 ).( t _ 1 ) = 0, Zj, t = l\Zj, a = 1,X, -Z , j ) i:( s _i)) 

= Pr^(Zj ! ( s+1 ).( t _ 1 ) = 0, = l|Z i)S = 1, Xj S T ) 

_ Prd)(AIj ja : T |Zj ]3 =l,Zj i ( s + 1): ( t _ 1 )=0.Zj !< = 1 ) Prd)(Zj i ( a+1 ).( f _ 1 )=0,Zj, t = l|Zj, s =l) 

Prd)(X,-, s:T |^>=l) 
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_ ^j'( S !^)Qi(^)(rTr-=s + l(l c j ( r ))) c j D z' T c; \ 

Qj(s) ■ 

To summarize, the procedure to sample Zj .\X. Z(_j\. is as follows: 

(1) For each t = 2,..., T, compute Cj(t ) according to Eq. 11. 

(2) (Backward pass) For each t = T,..., 1, compute Qj(t) according to Eqs. 12 and 13. 

(3) (Forward pass) Sample the smallest t for which Z h t = 1 according to Eq. 14. Sample each 
subsequent t for which Zj t t = 1 according to Eq. 15. 

Regarding computational cost, let us assume that Pj(t,s ) may be updated from Pj(t,s — 1) in 
constant time, as is true for all of the parametric models in Eqs. 3-8. Then computing the value 
of Cj(t ) for t = 2,..., T in step (1) above takes 0(T ) time. For step (2), the value of the summand 
for each s = t + 1,..., T in Eq. 13 may be updated from that for s — 1 in constant time, so each 
Qj(t) may be computed in 0(T ) time, and step (2) may be performed in 0(T 2 ) time. Finally, the 
value in the numerator of Eqs. 14 and 15 for each t = 2,..., T may be updated from that for t — 1 
in constant time, so step (3) may be performed in 0{T) time. Hence, sampling Zj } .\X, Z^_j^. for 
all sequences j = 1,..., J may be performed in 0(JT 2 ) time. 

We next describe the modification of this sampling algorithm to sample each row Zj in a block- 
wise fashion, by dividing each row Zj_. into K blocks Zj^.( tl _ x), ..., and Gibbs 

sampling the blocks sequentially. Let Tj{k) = max{r < tk '■ Z^ r = 1}, and let Sj(k) = min{s > 
4-i-i : Z, J)S = 1}, with the conventions rj{k) = 1 if = 0 and Sj(k ) = T + 1 if Zj )tk+1 -.T = 0. 

Let Pr^) denote probability conditional on Zj,t k+1 :T, and Z^_j^.. (Note then that 

rj{k ) and Sj(k ) are deterministic under Pr^’ fc ^.) Let Qj t k(t) = Pr ^’ k \^j,t:(sj(k)-l)\^j,t = 1) for 

tk<t< 4+1 - 1, and Qj.kitk - 1) = Pr (j ’ fe) {X jr j(k):(sj(k)~ i))- Then, in the backward pass, we may 
compute 


Qj,k(tk+1 1) — Pj(tk-\-l l,Sj(fe)), 

Afc+i—i / s —i \ 

Qj,k(t) = Y n ( 1_c i( r )) fo( s ) p i(M)<2j,fc(s) 


s=t -\-1 \r=t +1 
tk +1 — 1 

+ n t 1 ~ c i( r )) p i(^ s j( k )) for tk<t< 4+1 - 1, 

r=t +1 
tk+ l—l / s —1 


( tk+ l — l / s—1 \ 

y (n t 1 - foW)) c o( s ) p j(. r j( k )’ s )QjA s ) 

s=t k \r=t k J 

tk +1 — 1 

+ n c 1 - c j( r )) p j( r j( k ), s j( k ))> 


r=tu 


and sample each successive location where Zj jt = 1, for t £ {4, • ■ • , 4+1 — 1}) by 


pr^kw.)=o. zm =i a) = ^ |rilt), ‘ >g, 'tS:-i) |1 " Mr)>)c,<,> - 
= o-Zm = l \z M = 1 -+Z/,«,:(.-!)) = 


The derivations of these expressions are similar to those for Eqs. 12-15, and we omit them for 
brevity. 

The time required to sample each block of changepoint variables ^j,t fc :(t fc+1 -i) is 0((4+i — 4) 2 ), 
reducing the time required to sample all blocks of Zj,. to 0(T ) if the block sizes are 0(1). Then the 
total computational cost of sampling Zj t .\X, Z(_j\. for all sequences j = 1,..., J is reduced from 
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To sample each column Z.j conditional on the remaining columns let rt(j) and st(j ) 

denote the changepoints in the j th sequence immediately before and after time t, i.e., rt(j) = 
rnaxjr : r < t, Zj r = 1} and St(j) = min{s : s > t, Zj s = 1}, with the conventions rt(j ) = 1 if 
= 0 and st{j) = T + 1 if Zjt t+1 y T = 0. Let Pr^ denote probability conditional on Z./ t \ 
with associated conditional expectation E^. Note that rt(j ) and st(j) are deterministic under 
PrW. Let 

MJ) = = 1) = Pj(nU),t)Pj(t, s t (j )), (16) 

Bt{j) = = °) = Pj( r t(j), s t (j )) (17) 

for each j = 1where Pj(t,s ) is as defined in Eq. 2. For each j = 1and each 

k = 0,..., J—j, let Rt (j. k ) be the coefficient of x k y J ~^~ k in the polynomial Yli=j+i{At(i)x+B t (i)y), 

with the convention Rt(J, 0) = 1. We may compute all of the Rt(j,k) values recursively for 
j = J, J — 1,..., 1 in an “upward pass”: 


Rt(J, 0) = 1 (18) 

Bt(j)Rt(j + 1,0) k = 0 

Rt(j , k ) = < B t {j)R t (j + 1 ,k) + A t (j)Rt(j + 1 , k - 1 ) l<k<J-j-l (19) 

A t {j)R t {j+ l,J - j-l) k = J — j. 

Let N t (j) = l z i,t denote the number of changepoints at position t in sequences 1 to j — 1, 
with N t ( 1) = 0. Then 

PrP(q t = q\Z 1: (j_ 1 ) tt ,Xtf +1 yj t .) 

oc Pr (<) (X (j+1):Ji .|^ = g,Z 1 .( J -_ 1 ) jt )Pr (t) (Z 1: ( 7 -_ 1)it |g t = q)P^ t \q t = q) 

Pr (<) (W r |g t = q) J Pr(Z 1:(j -_ 1)jt |g t = q) Pr(g t = q) 

d=J+l / 

J 


- i=j +1 


(PrW(A' v |Z iit = 1, ft = g) Pr W (^, t = l|ft = q) 

+1 

+ Pr ( - t \X i! .\Z jtt = 0 ,q t = q)Pi- {t) (Z j:t = 0|ft = q)j j Pr(Z 1: (j_i) jf |g t = g) Pr(ft = q) 


<x j (^(i)g + 5 t (i)(l - g)) J g iVt( - ? )(l _ q) J 1 Pr(g t = g). 
P=i+i 


Letting c t (j) = Pr= l|Zp(j_i), t , V(j+i) : J,) = Xy+l):J,.], this implies 

f (nt j+ i(A(i)<! + - «))) « K - e)+1 (l - 

Ct(j') — ——-—- 

/ (nij+i(^i(*)? + - ^))) - q) j - l ~ Nt Wn Q (dq) 

= Efc=o {Pt(j,k) f g N *(j)+ k + 1 (l - q) J - Nt ^)- k - 1 TT Q {dq )) 

Efc=o ( R tU,k) f q Nt 0')+fe(l - g )J--Wt(j')-fc-i7rg(dg)) 
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Z'k =Q R t{j, k)f(N t (j) + k + l) 
£fc=o Mil k )s(N t (j) + k + 1) 


( 20 ) 


where /(•) and g(-) are as in Eqs. 9-10. We may then sequentially sample Zit, ..., Zj conditional 
on the data X and Z.r_ t \, in a “downward pass”: 



( 21 ) 




To summarize, the procedure to sample Z.^\Z./ t ^ is as follows: 

(1) For each j = 1,..., J, compute A t (j) and B t (j) according to Eqs. 16 and 17. 

(2) (Upward pass) For each j = J,... ,1 and k = 0,..., J — j, compute Rt(j, k ) according to 
Eqs. 18 and 19. 

(3) (Downward pass) For each j = 1,..., J, compute ct(j ) according to Eq. 20, and sample Zj t t 
according to Eq. 21. 

Regarding computational cost, computation of A t (j) and B t (j) for j = 1,...,«/ in step (1) 
requires 0( J) time if we compute the values of Pj (?’, t) and Pj (t, s ) by updating them from Pj (r,t— 1) 
and Pj(t — 1, s). In step (2), computation of Rt{j, k) for j = J, ..., 1 and k = 0,..., J — j may be 
performed in 0(J 2 ) time. In step (3), computation of ct(j ) for a single value of j may be performed 
in O(J) time, so step (3) may also be performed in 0(J 2 ) time. Hence, sampling Z.^X, Z. for 
all positions t = 2,..., T may be performed in 0( J 2 T) time. 

A computational shortcut is provided by noting that the sums in the numerator and denominator 
of Eq. 20 typically decay rapidly as k increases; this is theoretically justified by the fact that for 
each t and j, (R,t(j, ^))fc=o i s a log-concave sequence (being the coefficients of a real polynomial 
with real roots, see Theorem 2 of [Stanley, 1989]) and that the mode of this sequence occurs near 
k = 0 if most sequences do not provide evidence of a changepoint at position t. Hence in practice 
we truncate these sums in step (3) when the size of the summand falls below a small threshold, and 
we compute and store the values Rt(j,k ) in step (2) via lazy evaluation, only as they are needed 
in step (3). We observe empirically that this yields a very significant reduction in computational 
time and does not affect the results of posterior inference. 

Step 3: Swapping columns by Metropolis-Hastings 

Let Pj(t,s) be as defined in Eq. 2. The following describes a Metropolis-Hastings move that 
potentially swaps two adjacent columns of the changepoint variable matrix Z: 

(1) Let T = {t : ZU Z j)t > 0} be the set of positions where there is at least one changepoint. 
Select t uniformly at random from T, and set t' = t — 1 or t' = t + 1 randomly with 
probability \ each. If t = T, set t' = t — 1 with probability 1, and if t = 2, set t! = t + 1 
with probability 1. (Recall that in our notation, Z.j = 0 is fixed for t = 1.) 

(2) For each j = 1,..., J, if Zjj 7 ^ Zj >t >, let r{j) = max{r : r < (t A t'), Zj_ r = 1}, and let 
s(j ) = min{s : s > (t V t r ),Zj tS = 1}, with the conventions r(j ) = 1 if Zj.hpAt') = 0 and 
s(j ) = T + 1 if Zj^ tvt ,y T = 0. Compute 


p:= 


j : Zj,t— 15 Zj t r —0 


P j ( r (j),t)Pj(t, s(j)) 



n 


Pj(rU),t)Pj(t,s(j)) 

Pj{r(j),t') p j(t',s(j)y 
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(3) If J2j=i z j,t' > 0, or if (t,t') {(2,3), (3,2), (T - 1, T), (T, T - 1)}, then swap Z. it and 

Z.ji with probability min(p, 1). If Ylj=\ z j,t’ = 0 and E {(2,3 ),(T,T — 1)}, then 

swap Z. t t and Z. t j with probability min(|,l). Finally, if Ylj=iZj,t' = 0 and (f, t') E 
{(3, 2), (T — 1, T)}, then swap Z.j and Z.y with probability min(2p, 1). 

To see that this procedure keeps the posterior distribution invariant, let Z denote Z with columns 
t and if swapped. Note that under the BASIC model, Pr(Z) = Pr(Z). Then the quantity p 
computed in step (2) above is precisely 

_ Pr(X| Z) _ Pr (X,Z) _ Pi(ZjX) 

P ~ Pr(X| Z) ~ Pr(X, Z) ~ Pr(Z\X )' 

The procedure of selecting (t, t') in step (1) induces a transition probability Z — >• Z, where Pr(Z —> 
Z) = Pr(Z —> Z) in most cases, with the exceptions Pr (Z —> Z) = and Pr (Z Z) = sL if 

J2j=i z j,t' = 0 and (t,t') = (2,3) or (T,T — 1), and Pr(Z —> Z) = and Pr(Z —> Z) = i if 

J2f=i z j,t' = 0 an d = (3,2) or (T — 1,T). Step (3) above handles all cases with the correct 
Metropolis-Hastings acceptance probability. In practice, the most common scenario is when there 
are no changepoints at position t' , in which case the “swap” of columns t and t! simply shifts all 
changepoints at position t by one position. 

Regarding computational cost, to perform the above procedure, one may precompute Pj(t,s ) 
for each sequence j and each pair of consecutive changepoints t,s in sequence j (i.e., Z, j t = 1, 
z j,(t+i):(s-i) = 0) an d Zj,s = !)• This requires 0{JT) computational cost. Then step (1) above 
requires 0(1) cost, step (2) requires O(J) cost, and step (3) requires O(J) cost. Upon performing 
the swap in step (3), the set T and the values Pj(t,s ) may easily be updated in O(J) time, to 
prepare for the next application of this Metropolis-Hastings move. Hence, performing B total 
iterations of the Metropolis-Hastings move requires 0(JT + JB ) time. In our applications we set 
B = 10T, and we observe that the computational cost of performing all B Metropolis-Hastings 
steps is much smaller than the cost of the row-wise and column-wise Gibbs sampling procedures. 

Appendix C. Posterior maximization algorithms 
B elow are the details of the iterative posterior maximization algorithm discussed in Section 3.2. 

Step 1: Maximizing over rows 

Note that Px{Z\X) = Pr(Z ? - ) .|X, Z^_^ .) Pr{Z^_^ .\X), so maximizing Pr(Z\X) over the row Zj t . 
is equivalent to maximizing Pr(Zj^.\X, Z/^ .). To perform this maximization, we may employ the 
dynamic programming recursions developed by Brad Jackson et al. for the univariate changepoint 
problem [Jackson et al., 2005], in the following way. 

Note that 

Pr (Zj,-\X, Z ( _ j}j .) = Pr(Z jt .\X jt .,Z ( _ j)t .) 

oc Pr(Xj } .\Zj t .) Pr(Z j) .|Z ( _ i)) .) 

T 

= Pr(Jfj,|Z A .) n (F ■‘{Zjt = l|Z(- 3) ,] z «(l - Pr[Zj, t = l|Z M) ,J)‘- z -) 

t=2 
T 

= I>r(.V,. Z,.) Cj (t) z ^( 1 - (22) 

t=2 

where c 3 (t) = Pr [Zjj = 1| Z(_j^.] may be computed as Eq. 11. Define Mj( 1) = Px{Xj t i\Zj^ = 1), 
the marginal probability density of the first data point in sequence j assuming there is a changepoint 
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immediately after it, and for t = 2,..., T, define 

t 

Vp{Z jA:t ) = 1V(.Y, I:/ Z jA:l . Z iit+1 = 1) H Cj (r) Zj ’ r (l - c j (r)) 1 ~ Zj ’ r , 

r =2 

Mj(t ) = ma xVjj(Zj t i :t ). 

Zj A:t 

Then Eq. 22 is exactly Vj^iZj^r), and we wish to compute the sequence Zj A: t that achieves the 
maximal value Mj(T). We do this by iteratively computing Mj(t ) for t = 1 ,,T. 

Let Rj(t. 1) = Vj } t({ 0, 0,..., 0)) be the value of Vjj if there are no changepoints before position 
t in sequence j, and for s = 2,..., t, let 


Rj(t,s)= max V jjt (Z jM ) 

— 1 i Z j,(s + l):t — U 

be the maximal value of Vjt assuming that the last changepoint in sequence j before position t 
occurs at position s. Then, with Pj(t,s ) as in Eq. 2, 

Mj(l) = (23) 

t 

Rj(t, 1) = Pj{l,t + 1)JJ(1 — c i( r ))’ ( 24 ) 


r=2 


R 


•(t,s)= max f Pr(X J - 1:(s _ 1 )|Z J - 1:(s _ 1) ,2 ijS = 1) TTcj(r) Zj ’ r (l - Cj(r )) 1 x 

V ' r =2 J 

t 

Pr(X ijS:t | Z jt3 = 1, Z jt ( s+ i) :t = 0, Z j)t+ 1 = l)cj(s) nd- c i( r )) 


r=s+l 


Mj(s - !)P j (s,t + 1)0^5) (l-Cj(r)), 


r=s+l 


Mj(t) = max Rj(t,s). 


(25) 

(26) 


The above recursions are similar to those in Section II of [Jackson et ah, 2005]. From these 
recursions, we may compute Mj(t) for each t = 2,..., T by computing Rj{t , s) for each s = 1,..., t. 
In the sequence Z ]A -t that achieves the maximum value Mj(T), the last changepoint is the index t 
such that Mj(T ) = Rj(T , t), the changepoint before t is the index s such that Mj[t— 1) = Rj(t— 1, s), 
etc. 

To summarize, the procedure to maximize Pr(2'j i .|Jf, Zr_j\.) over Zj r is as follows: 

(1) For each t = 2,..., T, compute Cj(t) according to Eq. 11. 

(2) Compute Mj( 1) according to Eq. 23. For each t = 2,..., T, compute Rj(t, s) for s = 1,..., t 
according to Eqs. 24 and 25, and then compute Mj(t ) according to Eq. 26. For each t, save 
the value of s such that Mj(t ) = Rj(t, s ). 

(3) Let S = {T + 1}. While the smallest value in S is greater than 1, let this smallest value be 
t, let s be the value that achieved Mj(t — 1) = Rj(t — 1,s), update S —>■ 5u{s}, and repeat. 
When the smallest value in S becomes 1, set Zj t = 1 for each t E S with 2 < t < T, and 
set Zjj = 0 for all other t. 

Regarding the computational cost, computation of Cj{t) for t = 2,..., T in step (1) above requires 
0(T) time. For step (2), Rj(t, 1) may be computed in 0{T ) time for each t, and Rj(t,s ) may be 
updated from Rj(t, s — 1) in constant time for each s = 2,..., t, so all of the values Rj(t, s ) and 
Mj(t ) for t = 2,..., T and s = 1,..., t in step (2) may be computed in 0(T 2 ) time. Since step (3) 
may be performed in 0(T ) time, maximizing Pr(Zj j .|X, .) over Z^. for all j = 1,..., J may 
be performed in 0(JT 2 ) time. 
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We next describe the modification of this maximization algorithm to maximize over each row Zj_. 
in a block-wise fashion, by dividing each row Z^. into K blocks Zj tl ,i t2 _ i), ..., Zj ^ t k _ v t) 

and maximizing over the blocks sequentially. Let rj(k) = max{r < tk ■ Zj yV = 1}, and let Sj(fc) = 
rnirijs > t^+i ■ Zj tS = 1}, with the conventions Vj(k) = 1 if = 0 and Sj(k) = T + 1 

if Zj, tk+i:T = 0. Then we may set Mjk(t k — 1) = Pj(rj(k),tk ) and compute recursively for t = 
4, • • • , ^k ■ l 1 and s tfc; • • • 1 1 


t _D = ) p j( r j( k )’ t+1 )n t r=t k ( 1 - c j( r )) t = 4, ■■■,4+1 - 2 
1 Pj (4 i k )i s j( k )) n^r 1 ( 1 - 4( r )) t = 4+1 - 1, 

>• (t si = / Mj ’ k ( S - + 1 ) c j( s ) nUi(! “ c i( r )) * = 4, • ■ • ,4+1 - 2 

- 1)-Pj(s, 3j(k))cj(s) n^+i 1 (1 - Cj(r)) t = 4 +1 - 1, 

Mjk(t)= max Rj(t,s). 

’ S=tv.~ 


The interpretations and derivations of the above expressions are similar to those for Eqs. 22-26, 
and we omit them for brevity. Then, initializing S = {ifc+i}, we may iteratively take the smallest 
value t in S, let s be such that Mj^(t — 1) = Rj^it — 1, s ), update S — > S U {s}, and repeat until 
s = tk — 1, to obtain -l) that maximizes the posterior probability over this block. 

The time required to maximize over each block Zj,t k :(t k+1 -l) is 0((4+i — 4) 2 ), reducing the time 
required to maximize over all blocks of Zj t . to 0{T) if the block sizes are 0(1). Then the total 
computational cost of maximizing over Zj,. for all sequences j = 1,..., J is reduced from 0( JT 2 ) 
to O(JT). 


Step 2: Maximizing over columns 


Note that Pr(Z|A) = Pr(Z. jt |X, Pr(Z. (_^|A"), so maximizing Pr(Z|A) over the col¬ 

umn Z.j is equivalent to maximizing Pr(Z. i j| A, Z. ,(_q)- To perform this maximization, let Nt = 
Ylj=i Zj.t denote the number of changepoints at position t. Note that Nt is a function of Z.j. Let 
jy (j ) and St (j ) denote the changepoints in the j th sequence immediately before and after position 
t, i.e., rt{j) = max{r : r < t. Z^ r = 1} and st(j) = min{s : s > t, Z ]yS = 1}, with the conventions 
r t (j) = 1 if Zj i: ( t-i) = 0 and s t (j) = T + 1 if Zj^ t +i):T = 0- Recall the quantities A t (j) and B t (j ) 
from Eqs. 16 and 17. Then 


Pr (Z., t \X, Z. )( _ t) ) oc Pr(A|Z) Pr (Z., t |Z. i( _ t) ) 


oc 


oc 


n II Bt (■?) Y Pr ( Z -Mt = q) Pr (% = q ) 


TT A t (j) 


\r-Zj,t =o 

W t ), 


q£S 


where f(k) is defined in Eq. 9. For any fixed Nt, the above quantity is maximized by setting 
Zjj = 1 for the N t indices j 6 {1,..., J} that correspond to the N t largest values of , and 
setting Zj t = 0 for all other j. Hence, to maximize Pr(Z.y|A, Z. ,(_t)) over Z. : t, we may perform 
the following procedure: 

(1) For each j = 1,..., J, compute Y(j) accor ding to Eqs. 16 and 17, and sort these values. 

(2) For each k = 0,..., J, compute the maximum value of j-Zj t =i RRp) /(&) over z -,t such 

that Zj t = k. Let k* be the value of k that maximizes this value. 
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(3) Set Zjf — 1 for the k* values of j corresponding to the k* largest values of and set 

Zj t = 0 for all other j. 

Regarding computation cost, may be computed for j = 1,..., J in step (1) in 0{J) time, 
if At(j) and B t (j) are updated from A t -\(j) and and they may be sorted in 0{J\ogJ) 

time. Step (2) may be performed in O(J) time. Since step (3) also may be performed in O(J) 
time, maximizing Pr(Z. i t|X, over Z.j for all t = 2, ... ,T may be performed in 0(JT log J) 

time. 

Step 3: Swapping columns 

The following procedure allows for adjustment of all changepoints at a position t to a new 
position t + 1 or t — 1: Let T = {t : J2j =i %j,t > 0} be the set of positions where there is at 
least one changepoint. For t 6 T, let Z + denote Z with columns t and t + 1 swapped, and let 
Z_ denote Z with columns t and t — 1 swapped. While there exists t E T such that Pr(X\Z) 
is less than Pr(X| Z + ) or Pr(A|Z_), update Z to Z + or Z_ accordingly, and repeat. Note that 
as Pr(Z\X) oc Pr(X|Z)Pr(Z) and Pr(Z + ) = Pr(Z_) = Pr (Z), the posterior probability Pt(Z\X) 
always increases with each swap. As in the case of our Metropolis-Hastings move in Section 3.1, 
the primary purpose of this routine is to swap column t for column t! = t + 1 or t' = t — 1 when 
^2j = i Zj f/ = 0, in which case the “swap” simply moves all changepoints at position t to t'. 

Regarding computational cost, one may precompute Pj(t,s ) for each sequence j and each pair 
of consecutive changepoints t,s in sequence j. This requires 0(J |7~j) computational time where 
|T| < T is the total number of positions with a changepoint in Z. Then it is evident that ^*x\z) 

and ^ fxz) ma y ' 3e computed in O(J) time from these quantities. Upon performing a swap of, 
say, t with t + 1, the new values Pj(t+ 1, s) and Pj(s, t+1) for changepoints s immediately preceding 
and following t + 1 may be computed in 0(J ) time, to prepare for evaluation of the next swap. 
Hence each swap throughout the procedure may be performed in 0(J ) time. In practice, we observe 
that very few swaps are made, and the total computational cost of column-swapping is dominated 
by the 0{J\T\) initialization time and is also negligible compared to the costs of row-wise and 
column-wise maximization over Z. 


Appendix D. MCEM algorithms 

We describe details of the maximization steps in our MCEM procedure. Maximization over ?y is 
dependent on the choices of the likelihood model p(x\6) and the prior model p(0\rj). In all of the 
examples of Eqs. 3-8, g is a low-dimensional parameter, and a closed-form expression is available for 
computing log Pj(t, s\p). We use the BOBYQA zeroth-order optimization routine [Powell, 2009], 
as implemented in the C++ dlib library, to maximize over 7], 

For the maximization over the probability weights {wk}ke s, observe that the objective function 
is a convex function of these weights. In fact, define a probability measure ji nQ on {0,..., J} by 

MttqO') = f - q) J ~ 3 u k {.dq), 

kes J ^' 

i.e. p-kqU) is the probability under ttq of observing exactly j changepoints at any position t. 

Denote by fi the distribution over {0,..., J} with mass function jl(j) = X^ m =i m(T- i) ■ (N°f e that 
^2j =0 Nj = T — 1 by dehnition of Nj, so J2j=o AO') = 1-) Then the cross entropy between fl and 
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H-kq is given by 

J J M N {m) / /• / r\ 

- X] A(j) log /-^Q O') = "EE log \^2 Wk ( M 1 “ q) J 3 Vk{dq) 

3=0 j=0m=l 1 ' \k£S J VJ/ 

As this cross entropy is equal to Dkl{U\\^-k Q ) + H{p), where Dkl{U\\^q) denotes the Kullback- 
Leibler divergence and H{p) denotes the Shannon entropy, this implies 



M J 


M(T- 1) 

v 7 m= 1 j = 


EE A r j'" j log | ^ f q J (l - q) J h 
m=ij=o Vfces ^ 


>) 


u k{dq) = -D KL (n\\n nQ ) + const. 


for a constant independent of 7 Tq. Hence the optimization over ttq may be written as 

{w^}keS = argmin {wfc} D KL {fi\\Hn Q )- 

This may be solved efficiently via an iterative divergence minimization procedure 


w 


(*) 


w 


(*—i) 


AtQ)/g J (l-g) J 3 Vk{dq) 

3=0 'Zk'&S w k'~ 1] 1 9 j (! - q) J ~ j v k '{dq) 


(27) 


(28) 


which converges to the global optimum in Eq. 27, provided that it is initialized to a probability 
vector supported on all of S [Csiszar and Shields, 2004, Lashkari and Golland, 2007]. To iteratively 
compute the update in Eq. 28, one may precompute f qj( 1 — q) J ~^Vk{dq) for each j and k. 

In our applications, we take {ufcjfceS = {&/</} \?lo^ * ’ anc l we initialize {wj. 0 ^} such that = 0.9 

and the remaining probability mass of 0.1 is spread equally over the other grid points k/J. We 
initialize r/ 0 ) by dividing the data in each sequence into blocks of 100 data points, computing the 
sample mean and/or variance within each block, and matching the empirical moments of these 
sample means and/or variances to their theoretical moments under the prior ttq. For instance, 
for the normal model with changing mean, Eq. 3, we initialize fiQ to the empirical average of 
the block means, Oq to the empirical average of the block variances, and A to <7 q divided by the 
empirical variance of the block means. A similar procedure is used for the other parametric models 
of Eqs. 4-8. 


Appendix E. Gibbs sampling comparisons 

We examine convergence to equilibrium of our MCMC sampling algorithm on a data set with 
J = 50 sequences and T = 10000 observations per sequence. We compare the performance of 
our algorithm with a naive Gibbs sampler and investigate also the effect of row block size in the 
accelerated version of our sampler. The data was generated according to the BASIC model with true 
changepoint prior ttq = 0.995<5o + 0.005do.4, using the likelihood of Eq. 3 with no = 0, A = 1, and 
Uq = 1. The generated data contained 1018 total changepoints at 50 distinct sequential positions. 

We performed experiments in which we ran 200 iterations of the MCMC sampling procedure 
of Section 3.1. Prior parameters were initialized to default settings as discussed in Section D and 
updated with MCEM after sampling iterations 5, 10, 20, 30, and 50. Red lines in Figure 9 depict 
the error of the sampled changepoints at each iteration, averaged across 50 independent replicates 
of this experiment, with error bars depicting ±2 standard deviations. Panel (a) displays the relative 
changepoint error, which is the total 0-1 error of changepoint detections, divided by 1018 (the total 
number of true changepoints). Panel (b) displays the relative change position error, which is the 
0-1 error of detected sequential positions having a changepoint in any sequence, divided by 50 (the 
total number of true sequential positions having such a change). As a comparison, the dashed 
green curve in Figure 9 shows the errors when each sequence is treated individually as its own 
data set and indicates the accuracy of an analogous analysis that does not pool information across 
sequences. 
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Figure 9. Relative changepoint error (a) and change position error (b) of alterna¬ 
tive MCMC inference procedures applied to data generated from the BASIC model. 
Also plotted is the aggregated error from one run of an analysis of each sequence 
individually. 


Dashed blue curves and error bars in Figure 9 correspond to the results of applying a naive 
Gibbs sampling algorithm to sample from the posterior distribution under the BASIC model. In 
this naive sampler, the latent variables qt and 9j$ are still marginalized out analytically, but the 
latent changepoint variables Zj t are individually Gibbs-sampled. This sampling scheme is easy 
to implement and does not require the dynamic programming recursions detailed in Section B. 
To equate runtime with that of our MCMC procedure, 30 iterations of naive Gibbs sampling are 
treated as “one iteration” in Figure 9. We observe that even though many iterations of naive Gibbs 
sampling can be performed in the same amount of time as one iteration of our procedure, the naive 
Gibbs sampler did not consistently converge to the same level of error. 

Black and cyan curves in Figure 9 show errors from a single experiment of our MCMC sampler 
and the naive Gibbs sampler, respectively, initialized to the true changepoint matrix Z true and using 
the true priors ttq and ttq. Both curves remain stable around the same “equilibrium” error value 
across all 200 iterations, providing evidence that the our sampler without this ideal initialization 
(red curve) indeed reaches equilibrium sampling of the posterior distribution after few iterations. 

In the above comparisons, our MCMC sampler was run with the default setting of row block 
size 50 in the acceleration described in Section 3.3. Figure 10 explores the effect of this block size 
choice on sampling: We tested block sizes in powers of two between 1 and 1024, and the curves 
correspond to the mean error across 50 independent experiments for the same two error metrics. 
(The sampler with block size 1 is different from the naive Gibbs sampler above, as we still apply 
the column-wise Gibbs sampling and Metropolis-Hastings column swap steps of our procedure.) In 
this example, the average spacing between changepoints is 200 across all sequences and 500 in any 
particular sequence. We observe that there is only a small improvement in sampling if block sizes 
are increased beyond 64; however, there is a large increase in computational time per iteration. On 
the other hand, reducing the block size to be very small does not yield a substantial reduction in 
computational time, if the column-wise sampling step is still applied in each iteration. We believe 
our default choice of block size 50 is a reasonable setting in most applications. 
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Figure 10. Effects of row block size choice on sampling. Relative changepoint error 
and change position error are as in Figure 9. 
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Figure 11. Changepoint detection accuracy and signal reconstruction squared- 
error for various methods on aCGH data as simulated in Louhimo et al. [2012], 
without subsampling. 

Appendix F. Comparison of methods on data of Louhimo et al. [2012] without 

subsampling 

Figure 11 reports comparisons of changepoint detection and signal reconstruction accuracy for 
various methods on the original data generated by the aCGH simulator of Louhimo et al. [2012]; 
results for data obtained by subsampling every 10th point of each sequence were reported in Section 
4.2. 
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Appendix G. Preprocessing details for CNV analysis of the NCI-60 cell lines 

Our analyzed data corresponds to measurements of the log 2 -intensity-ratio for the NCI-60 cell 
lines made using the Agilent human genome CGH oligonucleotide microarray 44B (GEO accession 
GPL11068), as reported in [Varma et ah, 2014] and publicly available at http://www.ncbi.nlm. 
nih. gov/geo/query/acc. cgi?acc=GSE48568. We discarded data for the PR:DU145(ATCC) and 
PR:RC01 cell lines which were not part of the original NCI-60 DTP cell line screen, yielding 125 
sequences corresponding to 60 distinct cell lines. We mapped microarray probe IDs to genomic 
locations using the annotation file available at the Agilent website http://www. chem. agilent. 
com/cag/bsp/gene_lists.asp. 

As the samples do not correspond to the same gender, we discarded measurements on the sex 
chromosomes. We observed a sizeable mean-shift of the entire data sequence between replicate 
measurements of the same cell line, and hence median-centered each sequence at 0. 

The measurements of certain individual probes corresponded to large outliers in the data se¬ 
quences, with the outlier value being significantly higher in some sequences and significantly lower 
in others. We believe such measurements are likely due to technical noise in the Agilent oligonu¬ 
cleotide platform, as previously noted in Olshen et al. [2004] and Nowak et al. [2007]. We applied an 
outlier removal procedure similar to that in Olshen et al. [2004]: For each sequence, we computed 
a median-absolute-deviation estimate of the noise level a. For each location t, if the data value at 
t was the maximum or minimum in the window from t — 3 to t + 3, and if the difference between 
its value and the closest other value in this window exceeded 2a, then we replaced the value at t 
with the median over this window. 
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